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< ABSTRACT 



The stability^ of selected indices for detecting. differ- 
ential item perf dtmsence ( item >bia^s)^ from ■ one ; r^ndoml^ egui- 

' valent sample to < arjpther/ is 'addressed'. S&me recent re- 

* • . , ,■ ,..'•■* * * 

search has. criticized these indices as to^ unreliable for 
utility in measuring bias in achievement test iten^s^, Using" 
data from a\ national testing 'of the ACT Assessment, . however, 
this. study suggests that the reliability of the indices is 
situation-specific. Bias detection indices may be viewed as 
v most reliable, in testing situations that involve large sam- 
pie sizes and some item heterogeneity. . A preference is also 
stated fc^r assessing reliability bksed'on signecj rather than 
unsigned ^indices. / V 
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THE RELIABILITY OF MEASURING DIFFERENTIAL ITEM 

PERFORMANCE • \ 



6 



0 



Statistical procedures' l f or dfetec.tiag differential item 

• t ' ' v - ' • : . * , V ; . 

performance/ or item bias, have been studied expensively 
during the .past ten years ( see-* Rudnejr, Getson, & Knight, 

1980a;. Shj*pard, 1981) : Most of thi, s . research ' has addressed 

- . ' ■ ^ ■ » 

tj^e' very, important, but^elusive issue Qf validity for the 

various procedures. - The reliability' of rtjem. p^s- proce- 

duress, however, has not been thoroughly examin'ed. After a 

review of 'the literature, Ironson (1982) suggested that more 

information on the reliability of Various bias indices was 

needed. 

Kolen and Hoover ( 1982) - specif ically addressed the re- 
liability issue in their recent woirk. Using item responses 
•of fifth grade, students on the Iowa Tests of Basic Skills 
(LTBS), they £ound six "unsigned" procedures to, have very 



little reliability ( stability- from one randomly equivalent . 
group to another) 'fox; detecting bias with sample sizes of 
,; 2,00 per groiip. Median reliabilities 'across the subtests of 

the ITBS. ranged from :06: to .24 for the various indices . 

* . i . . . • 

*Their highest reliabilities, most between,, .2; and .5/ were 
found with language arts subtests . y r > 

The low reliabilities found by Kolen an^ Hoover, . howev- 
er, do not suggest that bias indices are universally unreli- 



- 2 ■> 



able* Certainly factors, such as item type, Sampled popula- 
tion, utilized sample" sizes, and type of index, (signed or 
unsigned-) may Jaave contributed to their results. The ITBS 
is a* welWedited battery of achievement tests, closely tied 
'to the basic academic skills for each grade level. Conceiv- 
ably, tests -that are not a's closely tied to specific curri- 

■• • • v. ■ ., ' '• ' . ' ; * . . • • 

cula might consist of more heterogeneous items that could 

lead to more instances of differential item performance, 

greater levels of reliability and, consequently, greater po.- 

tential utility for bias indices. 

, " Sample size, too, can have a considerable effect on ob-* 

served reliabilities. In their conclusion, Kolen and,Hobyer 

suggested that if bias- indices are to be useful, research is 

«.,...■ " • . "■ * 

needed to determine the sample sizes necessary for stable 

results . ' ^ ( . « 

'Another consideration is whether signed or. unsigned . 

item bias indices (Ironson <Sc Subkoviak, 1979) are, used. Ko- 

1'en and Hoover emphasized the unsigned versions in their re- 

search, "since item screening, as usually. conceiyed> in- 

valves eliminating items biased, against any group"* (p.- 3). 

This is not an illogical position from a test development 

standpoint. However, unsigned bias statistics do not take 

advantage of ' al^. available information (the direction of the 

"bias") as do. signed indices. -The result is unnecessarily 

low v estimates of reliability. 



V . ' ' i Objectives . ".; -. 

.The maj,or objective' of this research was i:o examine the* 
is^ue of reliable detection of item bias, as it applies to 
race, usdng: ? . , * - 7 

1. ; ,a professionally-developed. test, but cine less closely 

tied to specific curricula (and more heterogeneous) 
than the I TBS; 

2. v varied sample sizes of 200 and 1000 in each group; 

and- ? 5 ' ' . . ' "' . ' ' ' . 

w 3. both signed and unsigned versions of six bias indie - 
' es. ^ 
A supplementary objective was to examine intercorrelations 
among the various indices. , - 

■\ -. * ' ' ■ i v . :'. ; ' 

• Techniques . - \ * t , 

Six indices of item bias were, evaluated. All of these, 
statistics rely on internal analyses of the test to .identify 
deviant items. fc • < * * • 1 

Item Difficulty Index ( TIP -) . This index is the result / 
of the transformed item difficulties procedure '(Angoff & 
Ford, 1973). The TID approach is based orv the relative difr 
ficulty of an item for each of two groups,, controlling for 
total test score. % I terns are considered biased if they are 
relatively more difficult for one group than for another.. 
" • Point '. Biserial Index (PBIS) . Thds index is represented 
by the difference between the point biserial correlations/^ 




for. each group, pf the item with tptal score. It is parti- 
cularly' sensitive '*td v relative group "di-f f erences iti item dis- 
crimination. \y '■ -f ."• 

a Item-G^oup,,Partial Correlation ^ IGP ) . The* IGl> index 
(Strieker,' 198?)' is the correlation of the item, scored fc> 

Tight or wrong, ahd ; group .membership, controlling for total 
score. Tills- ;index>was (developed as a. readily interpretable 
measure of diffdreriti^i item performance. 

•:. Modified Chi -Square : Index «( MOD CHI). This index is" an 

• approximate chi- square index , ( Scheuneman, 1979) that is, 
based on a contingency tabLe of correct item responses, cor- 
■responding "td two . groups .and' some finite number of score in- 

"tervals (4„.were used in this study). The use of matched . 
score intervals roughly serves to eqiiate the two groups, 

^ within ajv interval, on 1 total test performance. The modified 
chi -square ihdex is sensitive to group differences in £oth 

<• . ■ . .- ;*> y ' 

item 'difficulty a*nd item discrimination. ' ~X 

Chi-Square Index ( CHI SQR ) ./ The full chi- square index 
(Shepard, Camilli, & Averill, 1980) is an' extension of 
Scheuneman' s contingency table analysis of correct responses 
to include a similar analysis of , incorrect ^responses . This ■ 
approach is also expected to- be sensitive to group differ- 
ences in both item difficulty and discrimination. 

* ■'* . ■ 

v 3-Parameter Index ( L&H ) ., -This index was proposed by > 
Linn and Harnisch (1981)- as a small sample alternative to 4 
existing 3-parameter* indices , that require larger sample 



sizes. To calculate tjjjhe iivdex, -the item and ability par ame 

ters of the 3-parameter item response theory model are esti 

■ '■ /' . 
mated for the total , sample. . The two -groups ■ are then sepa- 
rated. The difference is taken between" each examinee's 
probability of correctly answering the item and the exami- 
nee' s actual response to , the item ( l=corrept> O=incorrect) . 
This difference is then standardized and averaged over the 
examinees irx each group. The index is €he sum of the mean 
values for each group (Kolen & Hoover, 1982). ' 7 ." 



• Methodology 

The data consisted of item responses on the 75-item 
English Usage subtest of .the ACT Assessment (ACTE) 1 by 4000 
college-bound, high school students in April, 1980. The to- 
tal'sample included 2000" randomly selected black (62.3% fe- 
male) and 2000 randomly selected white (54; 3% female) stu- 
dents. Mean raw score performance was. 41.6 for the whites 
and 28.3 for the blacks. The standard deviation was 10.7 
for .each group. The initial and replication samples of 400 
and 2000 cases each were randomly selected without replace- 
nijsnt front this pool of . students . 



1 The ACT Assessment; is an achievement test directly related 
to high school instruction. However, since its focus is 
on the diverse currifeula taught in high schools,- it is 
thought to be less closely tied to curricula tl^an tests, 
such as the ITBS, aimed specifically at achievement in the 
basic ski lis . . « 



^ Item bias indices were, calculated \f or each sub samp Te. 

The, reliability of each index was then indicated by the cor-' 

• . • ■ • 

relalajJLon^between values of the index for^the two samples of 



200 black and" '^00 white students and for the two samples of/ 
1000 black and 1000 white, -students . Sinc^ signed and un- 
signed versions of the indices were investigated/ each i*idex 
has four reliabilities associated with it: the signed and 
unsigned versions' for the 400-case samples and the signed 
and unsigned versions for the 2000-case* samples. * 

An additional approach to y index reliability was per- 

" * , ■ *' 

formed, based only on the specific items /that were' identi- 
fled, as most biased. This approach was useful because it - 
provided a measure of the practical reliability of eich 
{procedure for identifying deviant items. The ten mo am devi- 
• a^nt items for each sample, as determined by each procedure/ 
\tere identified (the ten with the greatest ^absolute magni- 

i \\\ • . - . ? 



t^de of the index). Unweighted Kappa coefficients (Cohen, 

• ;/W . - • . ' . 

i960) werp then calculated for each procedure and each pair 

of samples. Since items were identified on the basis of the 
absolute magnitude of the/indices, this approach to reli- 
ability is clqsely related to the unsigned correlational re- 
liabilities . 

As with ireliability, tKfe interrelationships of the ind- 



ices were examined in two W3ys. First, by the intercorrela-. 
tibnsf of the (signed ^hdices, obtained- for one of the » 
"20bo-case samples; and secondly, by Kappa coefficients 1 of 



• * . ' • ■■■> . . > 

■ - 4 8 

agreement between items selected as most deviant by each iA- 
dex . v . J 

%, ' • » • 

Results « 

The reliabilities of the small and the large sample 
bias indices are shown in Table 1. As expected, in every 
case the reliabilities were higher for the larger than for 
the smaller samples. Also, as expected, the reliabilities 
for all the signed indices were higher than for their un- 
signed counterparts. Regardless of whether the signed or 
unsigned versions of the bias indices a<re compared, though, 
the TID approach seemed to be the most reliable. However, 
as Hunter* (1975) has pointed oUt, this index is spuriously 
sensitive to group differences in performance;^ The rela- 
tively high degree , of reliabi lity for this procedure rjiay be 
an artifact of the substantial performance difference bet- . 
ween blacks and whites .on the items (Kolen & Hoover, 1982) . 

The; remaining indices seemed to be, about equal in reliabili- 

' • - . . . . \ 

ty. " v. . ■ ■. J 

'Table 2 presents the Kappa cofef f icients and the number 
of deviant items selected in common, between samples, by v 
each index. It should be noted^hat separate^ consideration 
of signed and unsigned versions of the indices is not pre- 
sented here, because the results wouid be the same. That 



\ 



is, the same sets of Sterns, selected 



on' the' basis of the ab- 



solute magnitude of the index, would result. However, this 
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TABLE 1 

Reliabilities of Signed and Unsigned Bias Indices* 

Small Samples Large Samples 

(N = 400) CN = 20Q0) . 

Index * Signed Unsigned Signed Unsigned 



TID , .72 .39 \ .91 .80 

PBIS — - .59 ".48 .66 .58 

IGP ? .60 v .31 '<■! .74 .50 

MOD CHI < . 62 .47 vf' .77 .69 

CHI SQR -58 .3^2 % .76- - . 66 / . , 

L&H -61 .22 .79 .51 

, s j : : . Vi • • * 

* All values ,in the table are statistically significant (p < .001) 



analysis is most closely related to^he reliability estima- 
tion of the unsigned versions of the indices. At least for } 
the large sampies, the TID procedure again seemed to produce 
the most reliable results. Eight of the- ten deviant items 
(80%), identified by the TID procedure using the first of 
the large samples, were also identified ^using the second 
large sample. About 50 percent agreement between samples 
was evident for the other bias measures. 

From a pure "measurement perspective, the signed bias 
indices are preferred to the unsigned versions since they 
reflect not only magnitude but directionality as well. A 

better understanding of the relationships between*; the bias 

*** ■ * 

indices' thus found-by investigating the signed .versions . ■ 

* ■ 
Table 3 shows th$ intercorrelations among the signed bias 

indices-" fox one of the 2000-case - samples. Table 4 presents 
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■■ TABL'E 2^1 
Consistency of Deviant Item Selection 

Small Samples Large Samples 

(N = 400) (N = 2000) 

Index Kappa Common items* Kappa" Common items 



*.„..„. v 



TID • .31 4 .77 8 

PBIS .54 6 .54 6 

IGP 4 .14 3» .31 4 

MOD CHI .31 4 ' .42 5 

CHI SQR .31 . 4 ' ,42 5 

L&H * .14 -3 .31' ,4 



* The number of items that -are common to each- sample ' s set of 
ten most deviant items. 



the same intercorrelation matrix after correcting for atten- 
uation. The resu'lts indicate a great deal of similarity 
among the IGP, the modified Chi- square, the full Chi-square, 
and the Linn & Harnisch mea-sures. The TID procedure seems 
to be moderately related to these procedures, - while the 
Point BiserLal approach seems to stand alone. These results 
are consistent with expectations. The Point Biserial index 
is^the gtfily measure to emphasize group differences in item 
discrimination and , it clearly does no£ "correlate positively 
with the other procedures. The TID index emphasizes >only. ... 
group diffetences in item _di£f icuLty 4 and it, too, seems to 

mm ) (> 

stand at least somewhat 'apart from the others. The ^eniain- 
ing fo!ur indices- are sensitive to group differences in item 
•difficulty and discrimination, and they £eem to produce vfery 
aimilar results. 

♦ \y ■ ' : " ' . 
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TABLE 3 

Intercorreiations of Signed Bias Indices* 
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. 1.00 
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.94 






V 




1.00 


•on one 


of the 


2000-case samples 


and 



TID ^ 1.00 

PBIS 
"IGP 

MOD CHI 
CHI SQR 

L&H v 



the 75 ACTE items. 



: <4 C TABLE 4 

a Intercorreiations of Unattenuated Signed Bias Indices 



TID 

PBIS 

IGP 

MOD CHI 
CHI. SQR 
L&H 



'■Taole^ 5 presents measures of agreement in selection of 

■ .v . • • ■ v-r..'. ••* 

deVi : ant items between the ' different bias::;indiGes. As shown 
previously, in Table 4/. the TID and Point Biserial indices 
Vterid to stand. apa|f|. whereas the other four indices seem to 
&e more. closely related. To illustrate the commonality bet- 
ween the IGP, '-the modified Chi -square, the full Chi- square, 
and the Linn & Harnisch indices, four items were identified 
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» ■ 




1. 00 
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.92 








1.00 


.99 
1.00 



as being among the ten most deviant items by all four-proce- 
dures. Another* four items were identified by three of these 1 , 
four^rocedures, using one of the '2000-case samples. 



TABLE 5 ; . ,•■ : 

Measures , of Agreement in Deviant Item Selection* -* .. 

TID \ ■ PBIS :, , IGP MOD CHI CHI SOR L&H 



TID 1.00 -.15(0) .31(4) -.04(1) .19(3) .08(2) 

PBIS • 1.00 ' -.04(1) .08(2) -.04(1) .08(2) 

IGP • " 1.00 .54(6) .42(5) .77(8) 

MOD CHI •_ : 1.00 .54(6) .77(8) 

CHI SQR . ^ *j 1.00 .54(6) 

L&H 1.00 



* The first number for each combination is the Kappa coefficient 
of agreement. The values in parentheses are the numbers of 
iteirts common to the set of 10 most deviant items selected! by; 
6ach pVoc^dure, using one of "tlae 2000-case samples.* * 1 V 



■Discussion } 

The results demonstrate that testing situations do ex- 
ist in which bias indices can reliably detect differential 
item /performance . Like most phenomena in the social science 
es, though, the reliability of bias indices seems to be si- 
tuation-specific . Kolen and Hoover (1982) effectively ar- 
gued that" statistical bias detection procedures were not 
very reliable and, consequently, not very useful within the.' 
current test development process of the ITBS. However, with 
more heterogeneous tests, such as the ACT Assessment, and 
with larger sample sizes, commonly investigated bias indices 
can attain potentially useful levels of reliability. 



The maftner in which the bias indices are to be oised is' 

\ • ■ ; : . V 

also important. "If they are used in the -process of learning 

1 ' . • . / ■■■ .-■ " . ,. ■ : " • • 4 / \ 

more about the functioning t)f variouis items or i,tem types 
for different groups, both the degree and directionality of 
bias are important. The relevant reliability' analysis for 
this use of item bias, indices is correlational, as shown in ^ 
Table 1, Particularly with the 1'arger samples, but also , 
with the 400-case samples, the signed indices seem- to be 
reasonably reliable -for this purpose./ % 

■ : v. . ' ''^ . • - / • . 1 ■ 

If bias indices are Used as screening devices in the 

' • . - ■ / . ■ ' ■ ' " f ■ • 

test" development process, the relevant analysis is the sta- 

... . . ■•"■•■/ 

bility of item classification. /Although Table 2 indicates 
some commonality, these data gib not suggest thatthe bia^ 
indices can be relied on to/the exclusion of expert editori- 
al review. In fact, with /some curriculum-bound tests and a 
test development process that includes several stages of , 
thorough editorial review, these indices ma^ be relatively 
.useless (Kolen & Hoover, 1982). Tl^e indices seem to offer 1 
♦more promise, however, when used with more heterogeneous 
tests and when -used as a tool to scteen items for more ex- 
tensive editorial review. Neither common sense nor the re- 
suits of thi£*;study suggest that bias indices should siiper- 
*cede expert/ judgment on the desirability of an item within a 

test* ■'/■'"■■ - ■- . 

■'/•"' r ' . ' 
Future efforts in studying the reliability of item bias 

procedures migh^ ; focus on other indices, or the combination 



of two or more indices. . The investigation .of procedures 
stemming from "'latent trait theory ( Ironson, 1982; Lord, 
1980), for instance, would certainly be in order.. The joint 
reliability pf two or more, relatively independent and easi- 
ly -computed indices (such as the TID and PBIS) might also be 

useful..'' ' ' '\ - ' " ; 

Finally, Monte Carlo studies could be' very useful in 
the systematic exploration of index reliability. Use of si- 
mulated data, a la Rudner, Getson, and Knight ( 1980b )J. in - 

validity research, /might help clarify the effects of diffe- • 

/ " : • . . . * ■ -• . : . • ' 

rent types of tests and examinee populatipns on the reli-, 

ability of statistical item bias procedures, * - 
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